Coronary heart disease occurs when plaque builds up, causing arteries to narrow and limiting the flow of oxygen-rich blood to the heart. It is common among U.S. population and can last for years or be lifelong.
The Behavioral Risk Factor Surveillance System (BRFSS) is a program administered by CDC. Since 1984, BRFSS has been collecting data about U.S. residents regarding their physical and mental health, past and present health conditions, and access to healthcare. Each year, a standard survey questionnaire is developed that includes questions about : demographic information; current health perceptions, conditions, and behaviors; substance use; and diet.
In this project, I will be using BRFSS survey data in 2019.
This project aims to predict whether a person is at risk for heart disease based on BRFSS survey data. Since the objective of BRFSS is to support monitoring and analyzing factors influencing public health in the U.S., it is of interest to examine if survey questions from BRFSS can be used for preventative health screening for heart disease.
set.seed(474)
library(tidyverse)
library(tidymodels)
library(corrplot)
library(dplyr)
library(ROSE)
library(xgboost)
library(kknn)
library(gridExtra)
setwd('/Users/zengspencer/pstat131')
load("/Users/zengspencer/pstat131/brfss_raw.rda")
As we can see below, the original data set contains 343 variables and 418,268 observations in total. Since many of them are not related to my project, I’ll select 17 predictors based on preliminary literature review and rename columns.
dim(brfss_raw)
## [1] 418268 343
# selected variable of interest
brfss_selected <- brfss_raw %>% select('CVDCRHD4', 'X_BMI5', 'SMOKE100', 'X_RFDRHV7','CVDSTRK3','MENTHLTH', 'PHYSHLTH','DIFFWALK','EXERANY2','DIABETE4','X_IMPRACE','X_AGEG5YR','BPHIGH4','TOLDHI2','CVDSTRK3','FRUIT2','FVGREEN1','INCOME2','SEXVAR')
# rename columns
brfss_selected <- brfss_selected %>% rename(
'heart_disease' = 'CVDCRHD4',
'bmi' = 'X_BMI5',
'heavy_smoker' = 'SMOKE100',
'heavy_alcohol' = 'X_RFDRHV7',
'stroke' = 'CVDSTRK3',
'mental_health' = 'MENTHLTH',
'physical_health' = 'PHYSHLTH',
'difficult_walk' = 'DIFFWALK',
'exercise' = 'EXERANY2',
'diabetes' = 'DIABETE4',
'race' = 'X_IMPRACE',
'age' = 'X_AGEG5YR',
'high_bp' = 'BPHIGH4',
'high_chol'= 'TOLDHI2',
'stroke'='CVDSTRK3',
'fruit' = 'FRUIT2',
'veggies' = 'FVGREEN1',
'income' = 'INCOME2',
'sex' = 'SEXVAR'
)
Below is a data frame of variables containing missing values. We can see bmi, veggies, fruit, high_chol, exercise, heavy_smoker, difficult_walk, and income have comparatively more missing values (greater than 100).
# check for missing values
missing_values <- brfss_selected %>%
is.na() %>%
colSums()
df_mval <- data.frame(missing_values)
df_mval
## missing_values
## heart_disease 8
## bmi 36203
## heavy_smoker 15991
## heavy_alcohol 0
## stroke 11
## mental_health 19
## physical_health 32
## difficult_walk 13762
## exercise 20839
## diabetes 9
## race 0
## age 0
## high_bp 4
## high_chol 24443
## fruit 31052
## veggies 33075
## income 6881
## sex 0
Since the data set is large with 418,268 observations in total, I’ll drop rows containing missing values.
# drop missing values
brfss <- brfss_selected %>%
drop_na()
We can see, after dropping, we still have 341,083 observations.
dim(brfss)
## [1] 341083 18
When I checked the code book provided by CDC, most categorical variables contain levels “don’t know” and “refused”. Since these levels do not provide useful information to my project, I’ll drop observations containing these levels.
brfss <- brfss %>% subset(
heart_disease != 7 & heart_disease !=9 & heavy_smoker != 7 & heavy_smoker != 9
& heavy_alcohol != 7 & heavy_alcohol != 9 & stroke != 7 & stroke != 9 & mental_health != 77 & mental_health != 99
& physical_health != 77 & physical_health != 99 & difficult_walk != 7 & difficult_walk != 9 &
diabetes != 7 & diabetes != 9 & age != 14 & high_bp != 7 & high_bp != 9 & exercise != 7 &
exercise != 9 & fruit != 777 & fruit != 999 & veggies != 777 & veggies != 999 & income !=
77 & income != 99 & high_chol != 7 & high_chol != 9)
# convert to factors
brfss <- brfss %>% mutate(
heart_disease = factor(heart_disease,levels = c(1,2)),
heavy_smoker = factor(heavy_smoker),
heavy_alcohol = factor(heavy_alcohol),
stroke = factor(stroke),
mental_health = factor(mental_health),
physical_health = factor(physical_health),
exercise = factor(exercise),
diabetes = factor(diabetes),
race = factor(race),
high_bp = factor(high_bp),
high_chol = factor(high_chol),
income = factor(income),
sex = factor(sex))
And we are left with 263,841 observations
dim(brfss)
## [1] 263841 18
As we can see below, the response variable is highly imbalanced, with the percent of ratio of the number of respondents with heart disease to the number of respondents without heart disease being around 6.3%.
summary(brfss$heart_disease)
## 1 2
## 15717 248124
Since the data set is still pretty large, I decided to balance it by undersampling, keeping all of the data in the minority class and randomly deleting observations in the majority class until there are equal number of observations in each class.
# balance the data
brfss_balanced <- ovun.sample(heart_disease~., data=brfss, p=0.5, seed = 474,
method="under")$data
After undersampling, I obtained a dataset with 31,394 observations.
dim(brfss_balanced)
## [1] 31394 18
Below is the table of variable description:
| Name | Variable Description | Type |
|---|---|---|
| heart_disease | (Ever told) you had angina or coronary heart disease? | categorical |
| bmi | Body Mass Index | numeric |
| heavy_smoker | Have you smoked at least 100 cigarettes in your entire life? | categorical |
| heavy_alcohol | Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week) | categorical |
| stroke | (Ever told) you had a stroke. | categorical |
| mental_health | For how many days during the past 30 days was your mental health not good? | categorical |
| physical_health | For how many days during the past 30 days was your physical health not good | categorical |
| difficult_walk | Do you have serious difficulty walking or climbing stairs? | categorical |
| exercise | During the past month, other than your regular job, did you participate in any physical activities or exercises? | categorical |
| diabetes | (Ever told) (you had) diabetes? | categorical |
| race | Race category | categorical |
| age | Fourteen-level age category | categorical |
| high_bp | (Ever told) (you had) high blood pressure? | categorical |
| high_chol | (Ever told) you had) high blood cholesterol? | categorical |
| fruit | Not including juices, how often did you eat fruit? | categorical |
| veggies | How often did you eat a green leafy or lettuce salad, with or without other vegetables? | categorical |
| income | Annual household income category | categorical |
| sex | Sex of respondent | categorical |
Below is the first six rows of the tidied dataset:
head(brfss_balanced)
## heart_disease bmi heavy_smoker heavy_alcohol stroke mental_health
## 1 2 3859 1 1 2 15
## 2 2 2658 2 1 2 88
## 3 2 2193 2 1 2 88
## 4 2 2943 2 1 2 88
## 5 2 2995 1 1 2 30
## 6 2 3196 2 1 2 88
## physical_health difficult_walk exercise diabetes race age high_bp high_chol
## 1 88 2 2 3 2 3 3 2
## 2 88 2 1 3 1 10 3 1
## 3 88 2 1 3 1 11 3 2
## 4 88 2 1 3 1 12 3 2
## 5 2 2 2 3 1 5 3 1
## 6 3 2 1 3 2 12 1 2
## fruit veggies income sex
## 1 203 202 4 2
## 2 201 205 6 1
## 3 101 203 3 1
## 4 206 204 8 1
## 5 202 302 5 1
## 6 202 205 7 2
I will split the dataset into a training set and a testing set. The proportion of the training set to the whole dataset is 0.7.
brfss_split <- brfss_balanced %>%
initial_split(prop = 0.7, strata = "heart_disease")
brfss_train <- training(brfss_split)
brfss_test <- testing(brfss_split)
dim(brfss_train)
## [1] 21974 18
dim(brfss_test)
## [1] 9420 18
Before model building, let’s explore some correlations within our training set.
I will first plot a barplot of our response variable heart_disease.
brfss_train %>%
ggplot(aes(x = heart_disease)) +
geom_bar(fill='lightblue') +
labs(title = 'Distribution of heart disease') +
theme_minimal()
We can see there are equal number of observations in each class since I have balanced the dataset.
I will explore the relationship between heart disease and demographic risk factors including age, sex, race, and income.
plt_sex <-brfss_train %>%
ggplot(aes(x = heart_disease, group = factor(sex) ,fill=factor(sex))) +
geom_bar(position = position_stack()) +
labs(y = "Cases", x = "Heart Disease", title = 'Barplot of heart disease by sex') +
scale_fill_discrete(labels=c('Male', 'Female')) +
guides(fill=guide_legend(title="Sex")) +
theme_minimal()
plt_sex
From this stacked barplot, we can see males are more prone to heart disease compared with females.
labels <- as_labeller(c('1' = 'White','2' = 'Black','3' = 'Asian','4' = 'American Indian','5' = 'Hispanic','6' = 'Other'))
plt_race <- brfss_train %>%
ggplot(aes(x = race, group = factor(heart_disease) ,fill=factor(heart_disease))) +
geom_bar(position = position_stack()) +
facet_wrap(~race,scales = "free",labeller = labels) +
guides(fill=guide_legend(title="Heart Disease")) +
scale_fill_discrete(labels=c('No', 'Yes')) +
labs(title = 'Barplot of heart disease from each race ') +
theme_minimal()
plt_race
The barplot of heart disease from each race category shows that American Indian is most prone to heart disease.
plt_age <- brfss_train %>%
ggplot(aes(x = age, group = factor(heart_disease) ,fill=factor(heart_disease))) +
geom_bar(position = position_stack()) +
scale_x_discrete(labels = c('18 to 24', '25 to 29', '30 to 34',
'35 to 39', '40 to 44', '45 to 49','50 to 59','60 to 64',
'65 to 69', '70 to 74', '75 to 79', '80+')) +
guides(fill=guide_legend(title="Heart Disease")) +
scale_fill_discrete(labels=c('No', 'Yes')) +
labs(y = "Cases", x = "Age groups", title = 'Barplot of heart disease by age') +
theme_minimal()
plt_age
In terms of age, the number of people having heart disease increases significantly as age increases.
plt_income <- brfss_train %>%
ggplot(aes(x = income, group = factor(heart_disease) ,fill=factor(heart_disease))) +
geom_bar(position = position_stack()) +
scale_x_discrete(labels = c('<10k', '10k to 15k', '15k to 20k',
'20k to 25k', '25k to 30k', '30k to 35k',
'35k to 40k','40k to 50k',
'50k to 75k', '75k+')) +
guides(fill=guide_legend(title='Heart Disease')) +
scale_fill_discrete(labels=c('No', 'Yes')) +
labs(y = "Cases", x = "Income Levels", title = 'Barplot of heart disease by income levels') +
theme_minimal()
plt_income
When considering income, low income groups tend to be associated with a higher proportion of people getting heart disease.
Below, I examine the relationship between heart disease and personal behaviors including heavy smoking, heavy alcohol consumption, mental health, physcial health, and bmi.
mental_health <- brfss_train %>% subset (mental_health != 88)
plt_mh <- mental_health %>%
ggplot(aes(x = heart_disease, group = factor(mental_health) ,fill=factor(mental_health))) +
geom_bar(position = position_stack()) +
guides(fill=guide_legend(title="Days mental health not good")) +
theme_minimal()
plt_ph <- brfss_train %>%
ggplot(aes(x = heart_disease, group = factor(physical_health) ,fill=factor(physical_health))) + geom_bar(position = position_stack()) +
guides(fill=guide_legend(title="Days physical health not good")) +
theme_minimal()
grid.arrange(plt_mh, plt_ph, ncol = 2)
In terms of mental health, the number of people having poor mental health over 20 days are higher among those with heart diseases; meanwhile, the number of people having no physical health issues during the past month is higher among those without heart diseases. These indicate that poor mental and physical health are associated with heart disease.
brfss_train %>%
ggplot(aes(x = heart_disease, group = factor(heavy_smoker) ,fill=factor(heavy_smoker))) +
geom_bar(position = position_stack()) +
guides(fill=guide_legend(title="heavy smoking")) +
scale_fill_discrete(labels=c('Yes', 'No')) +
labs(title = 'Barplot of heart disease by smoking status') +
theme_minimal()
The barplot of heart disease by heavy_smoker shows that people with heart disease smokes more frequently compared with those without .
brfss_train %>%
ggplot(aes(x = heart_disease, group = factor(heavy_alcohol) ,fill=factor(heavy_alcohol))) +
geom_bar(position = position_stack()) +
guides(fill=guide_legend(title="heavy drinker")) +
scale_fill_discrete(labels=c('No', 'Yes')) +
labs(title = 'Barplot of heart disease by alcohol consumption') +
theme_minimal()
Surprisingly, heavy alcohol consumption does not seem to significantly differ between those with heart diseases and those without.
ggplot(brfss_train, aes(x=bmi, color=heart_disease)) +
geom_density() +
theme_minimal()
The density curve of bmi by heart disease shows that people having heart disease tends to have higher weight.
I will set up a recipe to predict instances of heart disease, dummy coding categorical variables and normalizing all numerical variables.
brfss_recipe <- recipe(heart_disease ~., data = brfss_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_center(all_numeric_predictors()) %>%
step_scale(all_numeric_predictors())
I will randomly divides the training set into 5 groups of approximately equal size. Since I have already balanced the dataset before, I’ll not stratify the folds by response variable. These folds will be used for cross-validation, during which 1 fold will be used for assessment and the remaining 4 folds will be used for for analysis. This process will repeat 5 times, and a different validation set will be used each time.
brfss_folds <- vfold_cv(brfss_train, v = 5)
I will fit elastic net models, tuning penalty (amount of regularization) and mixture (proportion of lasso penalty) with 5 levels each
# set up model and workflow
elastic_net_spec <- multinom_reg(penalty = tune(),
mixture = tune()) %>%
set_mode("classification") %>%
set_engine("glmnet")
en_workflow <- workflow() %>%
add_recipe(brfss_recipe) %>%
add_model(elastic_net_spec)
# create a regular grid
en_grid <- grid_regular(penalty(range = c(-5, 5)),
mixture(range = c(0, 1)), levels = 5)
# fit on training set
tune_res_en <- tune_grid(
en_workflow,
resamples = brfss_folds,
grid = en_grid
)
After fitting models on the training set, I’ll select the model with the highest roc_auc and save the result to an RDA file for future model evaluation
# select the best model
best_model_en <- select_best(tune_res_en, metric = "roc_auc")
en_final <- finalize_workflow(en_workflow, best_model_en)
en_final_fit <- fit(en_final, data = brfss_train)
save(tune_res_en, en_final_fit, en_workflow,best_model_en, file = "/Users/zengspencer/pstat131/en.rda")
I will fit a k nearest neighbors, tuning the number of neighbors with 5 levels.
# set up model and workflow
knn_spec <-
nearest_neighbor(neighbors = tune()) %>%
set_mode("classification") %>%
set_engine("kknn")
# create grid
knn_workflow <- workflow() %>%
add_model(knn_spec) %>%
add_recipe(brfss_recipe)
# create a regular grid
knn_grid <- grid_regular(neighbors(range = c(1, 10)), levels = 5)
# fit on training set
tune_res_knn <- tune_grid(
knn_workflow,
resamples = brfss_folds,
grid = knn_grid
)
Similar to previous model, I will select and save the best model to an RDA fille.
# select the best model
best_model_knn <- select_best(tune_res_knn, metric = "roc_auc")
knn_final <- finalize_workflow(knn_workflow, best_model_knn)
knn_final_fit <- fit(knn_final, data = brfss_train)
save(tune_res_knn,knn_final_fit, knn_workflow, best_model_knn, file = "/Users/zengspencer/pstat131/knn.rda")
I will fit random forest, tuning trees, mtry (the number of randomly selected predictors), and min_n (minimal node size) with 5 levels each.
# set up model and workflow
rf_spec<-rand_forest() %>%
set_engine("ranger",importance="impurity")%>%
set_mode("classification")
rf_wf<-workflow()%>%
add_model(rf_spec %>% set_args(mtry=tune(),trees=tune(),min_n=tune()))%>%
add_formula(heart_disease~.)
# create grid
rf_grid <-grid_regular(
mtry(range= c(1,17)),
trees(range = c(50,200)),
min_n(range = c(10,30)),
levels = 5)
# fit on training set
tune_res_rf <-tune_grid(
rf_wf,
resample = brfss_folds,
grid = rf_grid,
metrics = metric_set(roc_auc)
)
After fitting models on the training set, I’ll select the model with the highest roc_auc and save the result to an RDA file for future model evaluation
# select the best model
best_model_rf <- select_best(tune_res_rf, metric = "roc_auc")
rf_final <- finalize_workflow(rf_wf, best_model_rf)
rf_final_fit <- fit(rf_final, data = brfss_train)
save(tune_res_rf,rf_final_fit, rf_wf, best_model_rf, file = "/Users/zengspencer/pstat131/rf.rda")
I will fit boosted trees, tuning mtry (the number of randomly selected predictors), min_n (minimal node size), and learning rate with 5 levels each
# set up model and workflow
boost_spec<-boost_tree()%>%
set_engine("xgboost")%>%
set_mode("classification")
boost_wf<-workflow()%>%
add_model(boost_spec %>% set_args(
learn_rate = tune(),
trees = tune())) %>%
add_formula(heart_disease~.)
# create a grid
boost_grid <- grid_regular(
learn_rate(range = c(-2,0.5)),
trees(range = c(10,1000)),
levels = 5)
# fit on training set
tune_res_boost <-tune_grid(
boost_wf,
resample = brfss_folds,
grid = boost_grid,
metrics = metric_set(roc_auc)
)
After fitting models on the training set, I’ll select the model with the highest roc_auc and save the result to an RDA file for future model evaluation
# select the best model
best_model_boost <- select_best(tune_res_boost, metric = "roc_auc")
boost_final <- finalize_workflow(boost_wf, best_model_boost)
boost_final_fit <- fit(boost_final, data = brfss_train)
# save result
save(tune_res_boost,boost_final_fit, boost_wf, best_model_boost,file = "/Users/zengspencer/pstat131/boost.rda")
# loading stored results
load("/Users/zengspencer/pstat131/boost.rda")
load("/Users/zengspencer/pstat131/rf.rda")
load("/Users/zengspencer/pstat131/knn.rda")
load("/Users/zengspencer/pstat131/en.rda")
The autoplot of result shows that smaller values of mixture (proportion of lasso penalty) and smaller values of penalty (amount of regularization) leads to higher accuracy and roc_auc.
autoplot(tune_res_en)
The autoplot of knn model shows that model performance increases as the number of neighbors increases.
autoplot(tune_res_knn)
The autoplot of random forest shows that the model perform best when the minimal node size and the number of trees are relatively large, and the number of predictors is relatively small (around 5).
autoplot(tune_res_rf)
The autoplot of boosted tree shows that model perform best when the learning rate is small and the number of tress is large.
autoplot(tune_res_boost)
Comparison of model accuracy shows that knn model with the number of trees equals 10 has the highest accuracy.
en_acc <- augment(en_final_fit, new_data = brfss_train) %>%
accuracy(truth = heart_disease, estimate = .pred_class)
knn_acc <- augment(knn_final_fit, new_data = brfss_train) %>%
accuracy(truth = heart_disease, estimate = .pred_class)
rf_acc <- augment(rf_final_fit, new_data = brfss_train) %>%
accuracy(truth = heart_disease, estimate = .pred_class)
boost_acc <- augment(boost_final_fit, new_data = brfss_train) %>%
accuracy(truth = heart_disease, estimate = .pred_class)
accuracies <- c(en_acc$.estimate, knn_acc$.estimate,
rf_acc$.estimate, boost_acc$.estimate)
models <- c("elastic net", "knn", "random forest", "boosted tree")
results <- tibble(accuracies = accuracies, models = models)
results %>%
arrange(-accuracies)
## # A tibble: 4 × 2
## accuracies models
## <dbl> <chr>
## 1 0.883 knn
## 2 0.830 random forest
## 3 0.796 boosted tree
## 4 0.767 elastic net
best_model_knn
## # A tibble: 1 × 2
## neighbors .config
## <int> <chr>
## 1 10 Preprocessor1_Model5
We’ve selected our best model — knn with 10 neighbours.
Let’s fit our selected model to the testing set and see the results.
knn_wf_tuned <- knn_workflow %>%
finalize_workflow(select_best(tune_res_boost, metric = "roc_auc"))
best_fit_boost <- fit(boost_wf_tuned, brfss_test)
save(best_fit_knn, file = "/Users/zengspencer/pstat131/best_fit_knn.rda")
load("/Users/zengspencer/pstat131/best_fit_knn.rda")
The roc curve is to the up left corner of the panel, suggesting that the model performs pretty well.
options(yardstick.event_first = FALSE)
# roc curve
augment(best_fit_knn, new_data = brfss_test) %>%
roc_curve(heart_disease, .pred_1) %>%
autoplot()
The roc_auc of knn on our testing set is around 0.966.
options(yardstick.event_first = FALSE)
augment(best_fit_knn, new_data = brfss_test) %>%
roc_auc(heart_disease, estimate = .pred_1) %>%
select(.estimate)
## # A tibble: 1 × 1
## .estimate
## <dbl>
## 1 0.966
options(yardstick.event_first = FALSE)
augment(best_fit_knn, brfss_test) %>%
conf_mat(truth = heart_disease, estimate = .pred_class) %>%
autoplot(type="heatmap")
In conclusion, k nearest neighbors performed the best while elastic net performed the worst. I’m a little bit surprised that random forest and boosted trees performed worse than knn. Overall, the result suggests that BRFSS’s survey questions is good for preventative health screening and monitoring heart disease. In terms of future analysis, if computing power allows, it may be useful to reexamine the technique used to balance the dataset and choose another one (i.e. oversampling) to see if model performance can be improved. Also, rather than performing feature selection based on literature review, I might try feature extraction to see if some other indicators are ignored. Finally, I would like to see if BRFSS survey questions can be used as indicators for other chronic health conditions, such as heart attack and stroke.